from IPython.display import Image
Image('/home/nikolay/Documents/Medium/DeepLearningMetagenomics/images/hmp.png', width = 2000)
The idea of estimating microbial composition for a given sample is inspired by the popular QIIME tool called SourceTracker. Originally, SourceTracker was built on 16S data, i.e. using only 16S ribosomal RNA genes. In contrast, here we will try to train a classifier using shotgun metagenomics data. In addition, SourceTracker was not designed to be used on raw metagenomics sequences, but instead it needs microbial abundabces (OTU abundances) quantified in some way, e.g. through the QIIME pipeline, MetaPhlan or Kraken. In contrast, the goal of the CNN classifier we will train here is to go from a fastq-file to predicting microbial composition of a metagenomic sample.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/DeepLearningMetagenomics/images/qiime_plus_sourcetracker.png', width = 2000)
The details of the SourceTracker algorithm can be found in the original SourceTracker publication. It is a Bayesian version of Gaussian Mixture Model (GMM) clustering algorithm (seems similar to the Direchlet Multinomial Mixtures) that is trained / fit on a reference data set called Sources (i.e. different classes such as Soil or Human Oral or Human Gut), and can estimate proportion / contribution of each Source into test samples called Sinks, therefore the SourceTracker algorithm explicitly models a Sink sample as a mixture of Sources. The Bayesian flavour of the algorithm comes from using Direchlet priors when fitting the model.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/DeepLearningMetagenomics/images/SourceTracker.png', width = 2000)
In the original SourceTracker publication, the authors demonstrate that the Bayesian incorporation of uncertainties into the prediction procedure allows to outperform other analogous Machine Learning algorithms such as Naive Bayes and random Forest.
As a data source we will use the results of the Human Microbiome Project (HMP).
Human Microbiome Project (HMP) provides a fantastic resource for the reference of human microbiome. The project accumulated large amount of data from different human "tissues", or rather microbial communities such as Gut, Skin, Oral etc. communities. We will use the table of microbial abundances produced by Metaphlan2 for different tissues from here https://www.hmpdacc.org/hmsmcp2/, let us read and have a look at the data:
import numpy as np
import pandas as pd
#If you would like to start working directly with bacteria on genus level please uncomment the lines below
#microb = microb[[i for i in list(microb.columns) if not 's__' in i and 'g__' in i and 'k__Bacteria' in i]]
#print(microb.shape)
microb = pd.read_csv('/home/nikolay/Documents/Medium/DeepLearningMetagenomics/microb.txt', sep = '\t')
microb
The microbial abundances were measured for 2437 microbes at different taxonomic levels, and 2335 samples coming from different human tissues: Oral, Gut, Skin and Vaginal microbiomes. Let us display phenotypic variables available for the HMP data. There are not some any useful phenotypes, but perhaps of particular interest are STArea that means the tissues of origin, Gender and STSite that stands for the more specific method or site of the body where the microbes originate from. Here we will also build a "Color" variable where Gut samples are going to be "blue", Oral samples "red", Skin samples "green" and Vaginal samples "orange".
import warnings
warnings.filterwarnings("ignore")
phen = pd.read_csv('/home/nikolay/Documents/Medium/DeepLearningMetagenomics/phen.txt', sep = '\t')
phen['Color'] = 'blue'
phen['Color'][phen['STArea'] == 'Oral'] = 'red'
phen['Color'][phen['STArea'] == 'Skin'] = 'green'
phen['Color'][phen['STArea'] == 'Vaginal'] = 'orange'
phen
Let us display the numbers of samples available from the Oral, Gut, Skin and Vaginal microbial communities. It looks like the Oral microbiome samples are doiminating, and are at least twice as many as of Gut, Skin and Vaginal samples. However, overall the dataset does not look crazy imbalanced which is good for training a CNN classifier.
phen['STArea'].value_counts()
phen['Gender'].value_counts()
phen['STSite'].value_counts()
Above we can see that the fecal samples (Stool) represent the majority of HMP samples, this should correspond to the Gut tissue. Another thing is that the HMP samples seem to be pretty balanced with respect to Gender, i.e. nearly as many Female samples as Male ones. Next, we are going to visualize the samples using dimensionality reduction techniques such as Principal Component Analysis (PCA) and tSNE, we are also going to color the samples by their tissues of origin.
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
sns.set(font_scale = 1.5)
import numpy as np
from umap.umap_ import UMAP
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X = microb.values
X = np.log10(X + 1)
plt.figure(figsize = (20, 15))
pca = PCA(n_components = 20).fit(X)
index = np.arange(len(pca.explained_variance_ratio_))
plt.bar(index, pca.explained_variance_ratio_)
plt.title('Variance Explained by Principal Components', fontsize = 20)
plt.xlabel("Principal Component", fontsize = 20)
plt.ylabel("Variance Explained", fontsize = 20)
plt.show()
plt.figure(figsize = (20, 15))
X_reduced = PCA(n_components = 2).fit_transform(X)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], s = 50, c = phen['Color'])
plt.title('Principal Component Analysis (PCA): Human Microbiome Project (HMP)', fontsize = 25)
plt.xlabel("PCA1", fontsize = 22); plt.ylabel("PCA2", fontsize = 22)
from matplotlib import cm
import matplotlib.patches as mpatches
my_legends = [mpatches.Patch(color = 'blue', label = 'Gut'),
mpatches.Patch(color = 'red', label = 'Oral'),
mpatches.Patch(color = 'green', label = 'Skin'),
mpatches.Patch(color = 'orange', label = 'Vagina')]
plt.legend(handles = my_legends, fontsize = 20)
plt.show()
plt.figure(figsize = (20, 15))
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 50,
init = X_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X)
plt.scatter(tsne[:, 0], tsne[:, 1], s = 50, c = phen['Color'])
plt.title('tSNE: Human Microbiome Project (HMP)', fontsize = 25)
plt.xlabel("tSNE1", fontsize = 22); plt.ylabel("tSNE2", fontsize = 22)
from matplotlib import cm
import matplotlib.patches as mpatches
my_legends = [mpatches.Patch(color = 'blue', label = 'Gut'),
mpatches.Patch(color = 'red', label = 'Oral'),
mpatches.Patch(color = 'green', label = 'Skin'),
mpatches.Patch(color = 'orange', label = 'Vagina')]
plt.legend(handles = my_legends, fontsize = 20)
plt.show()
In both PCA and tSNE plots, we can see that the samples are clearly separable by the tissue of origin in terms of microbial abundances. This implies that it should be quite straighforward to train a CNN classifier that can predict the tissue of origin for a given sample by checking microbial DNA sequences in the sample. However, for making the PCA and tSNE plots we used microbial abundances and not sequencing data itself, therefore we will have to first extract microbial organisms that separate Gut, Oral, Skin and Vaginal samples the most. Reference genomes from these most discriminative microbial organisms will be used later for creating a sequence train data set for the CNN classifier.
For selecting microbial organisms that separate the Gut, Oral, Skin and Vaginal samples the most, we could have trained for example a Random Forest classifier and looked at the feature importances and their effect directions. However, here it looks like there are plenty of tissue-specific microbes, i.e. microbes that have high abundance in one tissue but almost zero in all other tissues. For example, if we look at Haemophilus parahaemolyticus that has an index 1950 in the microbial abundance data frame, it looks very Oral-specific:
[print(i) for i in microb.columns if 'Haemophilus_parahaemolyticus' in i]
microb.columns.get_loc('k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pasteurellales' + \
'|f__Pasteurellaceae|g__Haemophilus|s__Haemophilus_parahaemolyticus')
ind = 1950
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
#data_boxplot['Abundance'] = np.log10(data_boxplot['Abundance'] + 1)
groupped_data
groupped_data.groupby('Tissue').mean()
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sns.set(font_scale = 2)
plt.figure(figsize = (20, 15))
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.title('Haemophilus parahaemolyticus')
plt.show()
Let us visualize abundances of a few microbial organisms, we can see that some of them look very tissue-specific, i.e. they are present almost excusively only in one tissue while nearly absent in all other tissues. A few examples are Bifidobacterium longum which seems to be very gut-specific and reminds us about yogurts. Neisseria gonorrhoeae is a very oral-specific, Staphylococcus aureus is a known skin-specific bacteria, and Gardnerella vaginalis is a common marker of vaginal microbiota.
import matplotlib.pyplot as plt
plt.figure(figsize = (20,15))
sns.set(font_scale = 1.5)
sample_microb_names = ['Bifidobacterium longum', 'Neisseria gonorrhoeae',
'Staphylococcus aureus', 'Gardnerella vaginalis']
for i, ind in enumerate([284, 1759, 786, 292]):
plt.subplot(2, 2, i + 1)
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.title(sample_microb_names[i])
plt.show()
Let us display a few more interesting bacteria that often pop up in metagenomics projects, it is good to understand what tissue they belong to:
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 20))
sample_microb_names = ['Campylobacter showae', 'Helicobacter pylori', 'Neisseria meningitidis',
'Parvimonas micra', 'Rothia dentocariosa', 'Burkholderia cenocepacia',
'Bordetella genus', 'Pseudomonas stutzeri', 'Variovorax paradoxus',
'Mycobacterium smegmatis', 'Mycobacterium abscessus', 'Clostridium perfringens']
for i, ind in enumerate([1821, 1832, 1765, 1100, 214, 1654, 1645, 2016, 1697, 230, 226, 1055]):
plt.subplot(4, 3, i + 1)
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.xlabel("")
plt.ylabel("")
plt.title(sample_microb_names[i])
plt.show()
There are a few genera that have species that are tissue-specific in different tissues. For example, Streptococcus genus seems to be very much oral-specific. However, according to my experience, one can find Streptococcus species in other tissues such as gut. Indeed, here there are some examples of species from Streptococcus genus that are gut and oral specific:
import matplotlib.pyplot as plt
plt.figure(figsize = (20,15))
sns.set(font_scale = 1.5)
sample_microb_names = ['Streptococcus suis', 'Streptococcus mutans',
'Streptococcus equi', 'Streptococcus infantarius']
for i, ind in enumerate([1009, 957, 937, 943]):
plt.subplot(2, 2, i + 1)
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.title(sample_microb_names[i])
plt.show()
In the downstream analysis, for simplicity, we are going to concentrate on bacterial microbes only, i.e. archeae and viruses are going to be ignored, and only bacterial organisms classified only on genus level in the HMP data set will be kept for further downstream analysis.
We will build lists of tissue-specific bacterial genera and use their names for grepping bacterial reference genomes from the RefSeq NCBI database. Hence, here we subset the total matrix of microbial abundances down to becterial genera abundances.
import numpy as np
import pandas as pd
#Here I attempted to filter out low abundant microbes using the following R codes:
#low.count.removal = function(data, # OTU count data frame of size n (sample) x p (OTU)
# percent=0.001 # cutoff chosen)
# {
# keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
# data.filter = data[,keep.otu]
# return(list(data.filter = data.filter, keep.otu = keep.otu))
# }
#microb_clean<-low.count.removal(microb)$data.filter
#The microbe_clean.txt file contained microbes present in at least 0.1% of samples
#microb = pd.read_csv('/home/nikolay/Documents/Medium/DeepLearningMetagenomics/microb_clean.txt', sep = '\t')
#Select only bacterial genus level microbial abundances
microb = microb[[i for i in list(microb.columns) if not 's__' in i and 'g__' in i
and 'k__Bacteria' in i and not 'noname' in i and not 'unclassified' in i]]
microb
We conclude that there are 227 bacterial genera present in the HMP project.
Now, we are going to go through all the HMP bacterial genera and build lists of Oral-specific, Skin-specific, Vagina-specific and Gut-specific genera by using a criteriaon that if a bacterial genus is 10 times more abundant in one tissue compared to all other tissues, this genus is considered to be tissue-specific.
oral_specific_list = list(); oral_specific_ratio_list = list()
gut_specific_list = list(); gut_specific_ratio_list = list()
skin_specific_list = list(); skin_specific_ratio_list = list()
vagina_specific_list = list(); vagina_specific_ratio_list = list()
for ind in range(len(microb.columns)):
my_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
ratio = float(my_data.groupby('Tissue', sort=False).mean().sort_values(
'Abundance', ascending = True).iloc[3, :] / my_data.groupby('Tissue', sort = False).mean().sort_values(
'Abundance', ascending = True).iloc[2, :])
if( ratio > 10 and 'k__Bacteria' in microb.columns[ind] and 'g__' in microb.columns[ind]):
if(my_data.groupby('Tissue', sort = False).mean().sort_values(
'Abundance', ascending = True).index[3] == 'Oral'):
oral_specific_list.append(microb.columns[ind])
oral_specific_ratio_list.append(ratio)
if(my_data.groupby('Tissue', sort = False).mean().sort_values(
'Abundance', ascending = True).index[3] == 'Skin'):
skin_specific_list.append(microb.columns[ind])
skin_specific_ratio_list.append(ratio)
if(my_data.groupby('Tissue', sort = False).mean().sort_values(
'Abundance', ascending = True).index[3] == 'Vaginal'):
vagina_specific_list.append(microb.columns[ind])
vagina_specific_ratio_list.append(ratio)
if(my_data.groupby('Tissue', sort = False).mean().sort_values(
'Abundance', ascending = True).index[3] == 'Gut'):
gut_specific_list.append(microb.columns[ind])
gut_specific_ratio_list.append(ratio)
if((ind + 1) % 100 == 0):
print('Processed: {} microbes'.format(ind + 1))
Let us see how many vagina-specific bacterial genera we ended up with. In addition, below, we will display the vagina-specific genera:
len(vagina_specific_list)
vagina_specific_list
We seem to have 16 vagina-specific bacterial genera that we can rank by the fold-change, i.e. mean abundance in the tissue of main presence divided by the mean abundance of the second main tissue of presence.
df_vagina_specific = pd.DataFrame({'BACTERIA': vagina_specific_list, 'RATIO': vagina_specific_ratio_list})
df_vagina_specific = df_vagina_specific.sort_values('RATIO', ascending = False)
df_vagina_specific = df_vagina_specific.iloc[0:df_vagina_specific.shape[0],:]
pd.options.display.max_colwidth = 170
df_vagina_specific
In the list of vagina-specific bacterial genera (at least 10 times more abundant in the main tissue compared to any other tissue) above we can immediately see a few "strange" genera such as Burkholderia or Ralstonia. Those are very suspicious since there is some evidence that they might be PCR reagent contaminants. In the future, we will exclude them as well as two very lowly abundant and "noisy" genera: Elizabethkingia and Herbaspirillum. In total, this will give us 12 vagina-specific bacterial genera. What about oral-, gut- and skin-specific genera?
len(oral_specific_list)
oral_specific_list
df_oral_specific = pd.DataFrame({'BACTERIA': oral_specific_list, 'RATIO': oral_specific_ratio_list})
df_oral_specific = df_oral_specific.sort_values('RATIO', ascending = False)
df_oral_specific = df_oral_specific.iloc[0:df_oral_specific.shape[0],:]
pd.options.display.max_colwidth = 170
df_oral_specific
len(skin_specific_list)
skin_specific_list
df_skin_specific = pd.DataFrame({'BACTERIA': skin_specific_list, 'RATIO': skin_specific_ratio_list})
df_skin_specific = df_skin_specific.sort_values('RATIO', ascending = False)
df_skin_specific = df_skin_specific.iloc[0:df_skin_specific.shape[0],:]
pd.options.display.max_colwidth = 170
df_skin_specific
len(gut_specific_list)
gut_specific_list
df_gut_specific = pd.DataFrame({'BACTERIA': gut_specific_list, 'RATIO': gut_specific_ratio_list})
df_gut_specific = df_gut_specific.sort_values('RATIO', ascending = False)
df_gut_specific = df_gut_specific.iloc[0:df_gut_specific.shape[0],:]
pd.options.display.max_colwidth = 170
df_gut_specific
Let us write the lists of tissue-specific bacterial genera down to files.
with open('HMP_oral_specific_bacteria_genus.txt', 'w') as f:
for item in oral_specific_list:
f.write("%s\n" % item)
with open('HMP_gut_specific_bacteria_genus.txt', 'w') as f:
for item in gut_specific_list:
f.write("%s\n" % item)
with open('HMP_skin_specific_bacteria_genus.txt', 'w') as f:
for item in skin_specific_list:
f.write("%s\n" % item)
with open('HMP_vagina_specific_bacteria_genus.txt', 'w') as f:
for item in vagina_specific_list:
f.write("%s\n" % item)
Now, we are going to extract genera names by parsing the lists of the tissue-specific bacteria. We will put all the tissue specific genera names into a list of lists tissue_specific_list and save its 4 elements (4 lists) into separate files.
tissue_specific_list = []
for tissue in [oral_specific_list, gut_specific_list, skin_specific_list, vagina_specific_list]:
parsed_genus_bacteria = list(set([i.split('|')[5] for i in tissue]))
parsed_genus_bacteria = [i.replace('g__','') for i in parsed_genus_bacteria]
tissue_specific_genera = list(set(parsed_genus_bacteria))
print(len(tissue_specific_genera))
tissue_specific_list.append(tissue_specific_genera)
We will also write the tissue-specific genera names down to text files.
for index, tissue in enumerate(['oral','gut','skin','vagina']):
with open('tissue_specific_genera_names_' + tissue + '_full_list.txt', 'w') as f:
for item in tissue_specific_list[index]:
f.write("%s\n" % item)
We can see that human oral environment has 61 (the most) tissue-specefic microbial genera, while vagina has 16 (the least) tissue-specific genera.
To double check that we selected tissue-specific genera, let us plot their microbial abundances and visually instpect that the selected genera are indeed oral-specific, gut-specific, skin-specific and vagina-specific.
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 40))
sample_microb_names = ['Streptococcus', 'Neisseria', 'Veillonella', 'Actinomyces', 'Treponema', 'Haemophilus',
'Rothia', 'Leptotrichia', 'Cardiobacterium', 'Campylobacter', 'Capnocytophaga',
'Eikenella', 'Oribacterium', 'Alloprevotella', 'Fusobacterium', 'Peptostreptococcus',
'Selenomonas', 'Parvimonas', 'Tannerella', 'Porphyromonas', 'Prevotella']
genus_index = [microb.columns.get_loc([i for i in microb.columns if ('g__' + j) in i][0])
for j in sample_microb_names]
for i, ind in enumerate(genus_index):
plt.subplot(7, 3, i + 1)
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.xlabel("")
plt.ylabel("")
plt.title(sample_microb_names[i])
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 40))
sample_microb_names = ['Blautia', 'Shigella', 'Faecalibacterium', 'Bacteroides', 'Dorea', 'Akkermansia',
'Clostridium', 'Ruminococcus', 'Subdoligranulum', 'Adlercreutzia', 'Oxalobacter',
'Gordonibacter', 'Anaerostipes', 'Oscillibacter', 'Catenibacterium', 'Cellulophaga',
'Eubacterium', 'Bilophila', 'Bifidobacterium', 'Coprobacillus', 'Desulfovibrio']
genus_index = [microb.columns.get_loc([i for i in microb.columns if ('g__' + j) in i][0])
for j in sample_microb_names]
for i, ind in enumerate(genus_index):
plt.subplot(7, 3, i + 1)
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.xlabel("")
plt.ylabel("")
plt.title(sample_microb_names[i])
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 30))
sample_microb_names = ['Staphylococcus', 'Pantoea', 'Myxococcus', 'Peptoniphilus', 'Proteus',
'Rickettsiella', 'Aeromonas', 'Citrobacter', 'Agrobacterium', 'Enhydrobacter',
'Finegoldia', 'Propionibacterium', 'Acinetobacter', 'Massilia', 'Hymenobacter',
'Corynebacterium', 'Bacillus', 'Micrococcus']
genus_index = [microb.columns.get_loc([i for i in microb.columns if ('g__' + j) in i][0])
for j in sample_microb_names]
for i, ind in enumerate(genus_index):
plt.subplot(6, 3, i + 1)
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.xlabel("")
plt.ylabel("")
plt.title(sample_microb_names[i])
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 30))
sample_microb_names = ['Marinococcus', 'Mobiluncus', 'Burkholderia', 'Sphingopyxis', 'Rhodospirillum',
'Ureaplasma', 'Caulobacter', 'Elizabethkingia', 'Ralstonia', 'Gardnerella', 'Chlamydia',
'Asticcacaulis', 'Mycobacterium', 'Herbaspirillum', 'Lactobacillus', 'Achromobacter',
'Mycoplasma', 'Atopobium']
genus_index = [microb.columns.get_loc([i for i in microb.columns if ('g__' + j) in i][0])
for j in sample_microb_names]
for i, ind in enumerate(genus_index):
plt.subplot(6, 3, i + 1)
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.xlabel("")
plt.ylabel("")
plt.title(sample_microb_names[i])
plt.show()
Let us select 3 the most tissue-specific genera from the 4 tissues:
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 20))
sample_microb_names = ['Neisseria', 'Actinomyces', 'Haemophilus',
'Bacteroides', 'Akkermansia', 'Faecalibacterium',
'Staphylococcus', 'Propionibacterium', 'Peptoniphilus',
'Lactobacillus', 'Gardnerella', 'Ureaplasma']
genus_index = [microb.columns.get_loc([i for i in microb.columns if ('g__' + j) in i][0])
for j in sample_microb_names]
for i, ind in enumerate(genus_index):
plt.subplot(4, 3, i + 1)
groupped_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
sns.barplot(x = "Tissue", y = "Abundance", data = groupped_data)
plt.xlabel("")
plt.ylabel("")
plt.title(sample_microb_names[i])
plt.show()
We can clearly see that there are a few very strong tissue-specific bacterial genera markers. Vagina had the least, i.e. 16 tissue-specific genera. However, taking a closer look one could see that it included such genera as Burkholderia, Ralstonia, Bordetella that are suspicious and are very likely to be PCR reagent contaminations. Therefore we carefully examined the list of the 16 vagina-specific bacterial genera and confidently concluded vaginal specificity for only 12 of them due to different reasons. In order not to bias the classification analysis, we will select 12 most tissue-specific markers from all the 4 tissues.
tis_spec_df = pd.DataFrame({'Oral': ['Neisseria', 'Veillonella', 'Actinomyces', 'Haemophilus', 'Rothia',
'Leptotrichia', 'Cardiobacterium', 'Capnocytophaga', 'Oribacterium',
'Alloprevotella', 'Gemella', 'Fusobacterium'],
'Gut': ['Blautia', 'Faecalibacterium', 'Bacteroides', 'Dorea', 'Akkermansia',
'Clostridium', 'Ruminococcus', 'Subdoligranulum', 'Oxalobacter',
'Oscillibacter', 'Eubacterium', 'Bilophila'],
'Skin': ['Staphylococcus', 'Peptoniphilus', 'Citrobacter', 'Enhydrobacter',
'Finegoldia', 'Propionibacterium', 'Acinetobacter', 'Massilia',
'Hymenobacter', 'Corynebacterium', 'Bacillus', 'Micrococcus'],
'Vagina': ['Mobiluncus', 'Sphingopyxis', 'Ureaplasma', 'Caulobacter', 'Gardnerella',
'Chlamydia', 'Asticcacaulis', 'Mycobacterium', 'Herbaspirillum',
'Lactobacillus', 'Achromobacter', 'Atopobium']})
tis_spec_df
After a lot of thinking, we have excluded Streptococcus from the list of oral-specific genera. The reasons were explained above, i.e. if we grep reference genomes for Streptococcus, we will end up with too many "contaminating" sequences that originate from gut-specific Streptococcus species, however get erroneously assigned to oral microbiota if one assumes that streptococcus is an oral-specific genus.
The selected microbial genera can fairly well separate Oral, Gut, Skin and Vagina samples, let us run a quick clustering based on Spearman correlation distance between the samples and using the 48 selected microbial genera.
microb_genus = microb
microb_genus.columns = [i.split('|')[5] for i in microb_genus.columns]
microb_genus.columns = [i.replace('g__','') for i in microb_genus.columns]
microb_genus
gen_list = ['Neisseria', 'Veillonella', 'Actinomyces', 'Haemophilus', 'Rothia', 'Leptotrichia', 'Cardiobacterium',
'Capnocytophaga', 'Oribacterium', 'Alloprevotella', 'Fusobacterium', 'Blautia', 'Faecalibacterium',
'Bacteroides', 'Dorea', 'Akkermansia', 'Clostridium', 'Ruminococcus', 'Subdoligranulum',
'Oxalobacter', 'Oscillibacter', 'Eubacterium', 'Bilophila', 'Staphylococcus', 'Peptoniphilus',
'Citrobacter', 'Enhydrobacter', 'Finegoldia', 'Propionibacterium', 'Acinetobacter', 'Massilia',
'Hymenobacter', 'Corynebacterium', 'Bacillus', 'Micrococcus', 'Mobiluncus', 'Sphingopyxis',
'Ureaplasma', 'Caulobacter', 'Gardnerella', 'Chlamydia', 'Asticcacaulis', 'Mycobacterium',
'Herbaspirillum', 'Lactobacillus', 'Achromobacter', 'Atopobium']
print(gen_list)
microb_genus_selected = microb_genus[gen_list]
print(microb_genus_selected.shape)
microb_genus_selected
import scipy as sp
import seaborn as sns
sns.set(font_scale = 1)
rho, _ = sp.stats.spearmanr(microb_genus_selected.T)
rho = pd.DataFrame(rho, index = microb_genus_selected.index, columns = microb_genus_selected.index)
sns.clustermap(rho, cmap = 'vlag', figsize = (20, 20))
plt.text(5, 1.2,'Clustering HMP samples by Spearman correlation distance', fontsize = 35)
plt.show()
We clearly observe 4 clusters corresponding to the Oral, Gut, Skin and Vagina samples from the HMP project. The largest cluster are the Oral samples. The second largest cluster are corresponds to the Gut HMP samples.
Now we are going to use the names of the tissue-specific genera for extracting reference genome (fasta-files) IDs from the list of microbe names. For this purpose, we will be using the match-file MapBactName2ID.txt that looks as follows:
!head MapBactName2ID.txt
The file The MapBactName2ID.txt was built by extracting headers of the reference genomes and matching their file names (GCF-ids) with the headers. In my previous post here I showed how to download the reference genomes from the NCBI RefSeq database. Briefly the procedure is described here:
#wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
#grep 'Complete Genome' assembly_summary.txt \
#> assembly_summary_complete_latest_reference_genomes.txt
#awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt \
#> assembly_summary_complete_latest_reference_genomes_paths.txt
#mkdir BacterialGenomes
#for i in $(cat assembly_summary_complete_latest_reference_genomes_paths.txt)
#do
#wget -P BacterialGenomes ${i}/*genomic.fna.gz
#done
Now we are going to use the lists of the tissue-specific genera and simply grep the records from the MapBactName2ID.txt file that contain the matching reference genome IDs. Here we display a few lines of the results of grepping of FASTA reference genome IDs for the 12 selected oral, gut, skin and vagina-specific genera.
!grep -wFf tissue_specific_genera_names_oral_12genera.txt MapBactName2ID.txt | head
!grep -wFf tissue_specific_genera_names_gut_12genera.txt MapBactName2ID.txt | head
!grep -wFf tissue_specific_genera_names_skin_12genera.txt MapBactName2ID.txt | head
!grep -wFf tissue_specific_genera_names_vagina_12genera.txt MapBactName2ID.txt | head
!grep -wFf tissue_specific_genera_names_oral_12genera.txt MapBactName2ID.txt > tissue_specific_fasta_IDs_oral.txt
!grep -wFf tissue_specific_genera_names_gut_12genera.txt MapBactName2ID.txt > tissue_specific_fasta_IDs_gut.txt
!grep -wFf tissue_specific_genera_names_skin_12genera.txt MapBactName2ID.txt > tissue_specific_fasta_IDs_skin.txt
!grep -wFf tissue_specific_genera_names_vagina_12genera.txt MapBactName2ID.txt>tissue_specific_fasta_IDs_vagina.txt
Now when we have reference genome (fasta-files) IDs corresponding to tissue-specific genera, we can slice the reference genomes and create 4 sets of sequences (for example 80 nt long) with the labels 0, 1, 2 and 3 corresponding to oral, gut, skin and vagin microbial genera. In the loop below, we first read the files with fasta IDs for each of the 4 tissues, then for each of the fasta IDs we read the fasta files and select random 1000 sequences of length 80 nt. We will also on the fly drop sequences that contain 'N' nucleotide and restrict ourselves to only 'A', 'C', 'T' and 'G' containing sequences.
import pandas as pd
fasta_df = pd.read_csv('tissue_specific_fasta_IDs_oral.txt', header = None, sep = '\t')
fasta_df
import gzip
import random
import pandas as pd
nt = 80 # length of each sequence
Nseq = 3000 # number of randomly drawn sequences from each fasta-file
random.seed(123)
all_tissues_seqs = []
all_tissues_labs = []
for lab, tissue in enumerate(['oral', 'gut', 'skin', 'vagina']):
print('Working with tissue: {} ************************************************'.format(tissue))
fasta_df = pd.read_csv('tissue_specific_fasta_IDs_' + tissue + '.txt', header = None, sep = '\t')
tissue_seqs = []
for i in range(fasta_df.shape[0]):
with gzip.open('/home/nikolay/Documents/Medium/DeepLearningMetagenomics/BacterialGenomes/'
+ fasta_df[0][i], 'r') as f:
content = f.readlines()
del content[0]
random_seqs = random.sample(content, Nseq)
random_seqs = [j.rstrip()[0:nt].decode() for j in random_seqs]
random_seqs = [k for k in random_seqs if
k.count('A') > 0 and k.count('C') > 0 and k.count('G') > 0 and k.count('T') > 0
and len(list(k)) == nt and 'N' not in k and 'H' not in k and 'K' not in k
and 'M' not in k and 'R' not in k and 'S' not in k and 'V' not in k
and 'W' not in k and 'Y' not in k and 'B' not in k]
tissue_seqs = tissue_seqs + random_seqs
if((i + 1) % 500 == 0):
print('Finished {} fasta-files'.format(i + 1))
all_tissues_seqs = all_tissues_seqs + tissue_seqs
print(all_tissues_seqs[(len(all_tissues_seqs)-3):len(all_tissues_seqs)])
all_tissues_labs = all_tissues_labs + [lab]*len(tissue_seqs)
print(all_tissues_labs[(len(all_tissues_labs)-3):len(all_tissues_labs)])
len(all_tissues_seqs)
Let us check that we indeed have only A, C, G and T nucleotides present in the sequences, i.e. no other weird characters can be found:
import numpy as np
characters = [list(i) for i in all_tissues_seqs]
set(np.ndarray.flatten(np.array(characters)))
len(all_tissues_labs)
set(all_tissues_labs)
Here we quickly check how many sequences we have that are assigned to each class:
from collections import Counter
Counter(all_tissues_labs)
Here there is a problem. We have almost 5-6 times more sequences from Skin class compared to Oral, Gut or Vagina class. Therefore, there is a risk that a CNN classifier will learn to recognize the majority class which is the Skin. We need to create an unbiased data set, for this purpose we are going to randomly draw 350 000 sequences from each class.
import pandas as pd
df = pd.DataFrame({'SEQ': all_tissues_seqs, 'LAB': all_tissues_labs})
df
N = 350000
df_all_tissues = pd.concat([df[df['LAB'] == 0].sample(N), df[df['LAB'] == 1].sample(N),
df[df['LAB'] == 2].sample(N), df[df['LAB'] == 3].sample(N)])
df_all_tissues
all_tissues_seqs = list(df_all_tissues['SEQ'])
all_tissues_labs = list(df_all_tissues['LAB'])
from collections import Counter
Counter(all_tissues_labs)
The next step is to organize the data into a format that can be passed into a deep learning algorithm. Most deep learning algorithms accept data in the form of vectors or matrices (or more generally, tensors). To get each DNA sequence in the form of a matrix, we use one-hot encoding, which encodes every base in a sequence in the form of a 4-dimensional vector, with a separate dimension for each base. We place a "1" in the dimension corresponding to the base found in the DNA sequence, and "0"s in all other slots. We then concatenate these 4-dimensional vectors together along the bases in the sequence to form a matrix.
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# The LabelEncoder encodes a sequence of bases as a sequence of integers: 0, 1, 2 and 3
integer_encoder = LabelEncoder()
# The OneHotEncoder converts an array of integers to a sparse matrix where
# each row corresponds to one possible value of each feature, i.e. only 0 and 1 are present in the matrix
one_hot_encoder = OneHotEncoder()
input_features = []
for sequence in all_tissues_seqs:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',all_tissues_seqs[0][:10],'...',all_tissues_seqs[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
Now we will perform a similar operation with the sequence labels. While we could use the labels as a vector, it is often easier to similarly one-hot encode the labels, as we did the features.
one_hot_encoder = OneHotEncoder()
labels = np.array(all_tissues_labs).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()
print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
Finally, let us split the data set into train and test sub-sets. The purpose of the test set is to ensure that we can observe the performance of the model on new data, not seen previously during training.
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(
input_features, input_labels, test_size = 0.25, random_state = 42)
train_features.shape
train_labels.shape
test_features.shape
test_labels.shape
Now everything is ready for the classification with Convolutional Neural Network (CNN). Here we choose a simple 1D convolutional neural network (CNN), which is commonly used in deep learning for functional genomics applications.
A CNN learns to recognize patterns that are generally invariant across space, by trying to match the input sequence to a number of learnable "filters" of a fixed size. In our dataset, the filters will be motifs within the DNA sequences. The CNN may then learn to combine these filters to recognize a larger structure (e.g. the presence or absence of an ancient site on a sequence). We will start with defining Convolutional Neural Network (CNN) model and summarize the fitting parameters of the model.
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout
from tensorflow.keras.models import Sequential
model = Sequential()
model.add(Conv1D(filters = 512, kernel_size = 20, input_shape = (train_features.shape[1], 4),
padding = 'same', activation = 'relu'))
#model.add(MaxPooling1D(pool_size = 2))
#model.add(Dropout(0.3))
#model.add(Conv1D(filters = 512, kernel_size = 10, padding = 'same', activation = 'relu'))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(256, activation = 'relu'))
model.add(Dropout(0.1))
model.add(Dense(4, activation = 'softmax'))
epochs = 30
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss = 'categorical_crossentropy', optimizer = sgd, metrics = ['accuracy'])
model.summary()
Now, we are ready to go ahead and train the neural network. We will further divide the training set into a training and validation set. We will train only on the reduced training set, but plot the loss curve on both the training and validation sets. Once the loss for the validation set stops improving or gets worse throughout the learning cycles, it is time to stop training because the model has already converged and may be just overfitting.
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
filepath = "weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor = 'val_accuracy', verbose = 1, save_best_only = True, mode = 'max')
rlrp = ReduceLROnPlateau(monitor = 'val_accuracy', factor = 0.1, patience = 5, verbose = 1, min_lr = 0.0001)
history = model.fit(train_features, train_labels, epochs = epochs, verbose = 1, validation_split = 0.25,
batch_size = 32, shuffle = True, callbacks = [rlrp, checkpoint])
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sns.set(font_scale = 1.5)
plt.figure(figsize = (20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
Here we will save the final model just in case although the best (in terms of validation accuracy) model was also saved automatically when training into the weights.best.hdf5 file.
model.save('DeepLearningMicrobiome.h5')
The best way to evaluate whether the network has learned to classify sequences is to evaluate its performance on a fresh test set consisting of data that it has not observed at all during training. Here, we evaluate the model on the test set and plot the results as a confusion matrix.
from tensorflow import keras
model = keras.models.load_model('weights.best.hdf5')
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sns.set(font_scale = 1.5)
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize = (10,8))
predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis = 1), np.argmax(predicted_labels, axis = 1))
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix')
plt.colorbar()
plt.xlabel('True label')
plt.ylabel('Predicted label')
plt.xticks([0, 1, 2, 3]); plt.yticks([0, 1, 2, 3])
plt.grid('off')
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center',
color = 'white' if cm[i, j] > 0.5 else 'black')
plt.show()
The raw confusion matrix is printed out as a numpy array above. The graphical representation of the confusion matrix is the blueish heatmap that displays normalized accuracies per class. This is handy as one can immediately see that Oral and Skin microbial communities have lower accuracies of classfication compared to Gut and Vagina. This can be intiitively understood from the PCA plot in the beginning of this notebook, where Oral and Skin communities were quite mixed and less distinct from each other compared to Gut and Vagina clusters that lie separately from the rest. As an average of the blueish normalized confusion matrix above, one can use again the test data set (not used during training) and get an average evaluation accuracy which is almost 60%.
scores = model.evaluate(test_features, test_labels, verbose = 0)
print("Accuracy: %.2f%%" % (scores[1]*100))
The overall average evaluation accuracy of 60% might seem low at firt glance because our minds are set on a binary classification problem where a random classifier could give a basal accuracy of 50%. Therefore our result of 60% might seem just a little bit better than a random classifier. However, please recall that we are dealing here with 4 classes, and therefore a random classifier should give us an average accuracy of 25% (can be confirmed by simulations) with a standard deviation around 5-10% depending on the sample size. Therefore, the accuracy of 60% is significantly higher than the basal 25% that one would expect from a random classifier.
Now, when we have trained a classifier to assign a microbial DNA sequence to an Oral, Gut, Skin or Vagina class, let us try to chek whether it works well and assign an individual sequence to one of the 4 classes / tissues. Here we pick a random sequence (with index 100) from the test data set.
from tensorflow import keras
model = keras.models.load_model('weights.best_acc60_12genera.hdf5')
test_features[100,:,:]
We check the true class of the sequence which appears to be Oral, i.e. the 0 class:
test_labels[100,:]
Now we are making class prediction for the selected sequence:
model.predict(test_features[100,:,:][np.newaxis,:,:])
model.predict_classes(test_features[100,:,:][np.newaxis,:,:])
Here we see that the sequence with index 100 can be correctly assigned to belong to Oral bacteria since it gets the highest probability 42% for the Oral class, although Skin was also highly ranked, it got probability 38%. In the future, when we make predictions for a sample by averaging predictions for each sequence, we should probably only use sequences with most confident predictions, such as one class should have a probability >80%.
Here we are going to take a random FASTQ-file (from HMP or DIABIMMUNE), make predictions for each sequence in the FASTQ-file, and present fractions of microbes (microbial composition) for the sample. Let us start with a Stool sample from the HMP project. We will read the sequences from the FASTQ-file:
from Bio import SeqIO
test_seq_list = []
fastq_path = '/home/nikolay/Documents/Medium/DeepLearningMetagenomics/'
#fastq_file = 'HMP_Oral_SRS011243.denovo_duplicates_marked.trimmed.singleton.fastq'
#fastq_file = 'HMP_Oral_SRS055495.denovo_duplicates_marked.trimmed.singleton.fastq'
#fastq_file = 'HMP_Gut_SRS017247.denovo_duplicates_marked.trimmed.singleton.fastq'
fastq_file = 'HMP_Gut_SRS022713.denovo_duplicates_marked.trimmed.singleton.fastq'
#fastq_file = 'HMP_Skin_SRS017156.denovo_duplicates_marked.trimmed.singleton.fastq'
#fastq_file = 'HMP_Skin_SRS018369.denovo_duplicates_marked.trimmed.singleton.fastq'
#fastq_file = 'HMP_Vagina_SRS015071.denovo_duplicates_marked.trimmed.singleton.fastq'
#fastq_file = 'HMP_Vagina_SRS062752.denovo_duplicates_marked.trimmed.singleton.fastq'
#fastq_file = 'ERR3827177.fastq'
#fastq_file = 'ERR3827196.fastq'
#fastq_file = 'G69153_pe_1.fastq'
#fastq_file = '4491200149-1L1_S12_L001_R1_001.fastq'
#fastq_file = '4491200209-1L1_S66_L001_R1_001.fastq'
#fastq_file = 'SRR6900017_1.fastq'
##fastq_file = 'SRR6899944_1.fastq'
#fastq_file = 'SRR6899929_1.fastq'
#for record in SeqIO.parse(fastq_path + fastq_file, "fastq"):
# if record.id in ids_no_human and len(str(record.seq)) >= 80:
# test_seq_list.append(str(record.seq))
for record in SeqIO.parse(fastq_path + fastq_file, "fastq"):
if len(str(record.seq)) >= 80:
test_seq_list.append(str(record.seq))
len(test_seq_list)
Since the sample is quite big, we will subsample random 100 000 sequences for classification. This is done to speed up the prediction.
import random
test_seq_list = random.sample(test_seq_list, 100000)
len(test_seq_list)
Next, we will pre-process the data in the same way as the train data, i.e. we will trim all the sequences to 80 nt long DNA stretches, and we will exclude sequences that contain weird nucleitide, i.e. only A, C, T or G nucleotide are allowed to be present in the test data set.
nt = 80
test_seq_list = [j.rstrip()[0:nt] for j in test_seq_list]
test_seq_list = [k for k in test_seq_list if
k.count('A') > 0 and k.count('C') > 0 and k.count('G') > 0 and k.count('T') > 0
and len(list(k)) == nt and 'N' not in k and 'H' not in k and 'K' not in k
and 'M' not in k and 'R' not in k and 'S' not in k and 'V' not in k
and 'W' not in k and 'Y' not in k and 'B' not in k]
len(test_seq_list)
test_seq_list[0]
Finally, we will need to one-hot encode the test data set in exactly the same way as we did for the train data set.
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder()
input_test_features = []
for sequence in test_seq_list:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_test_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
input_test_features = np.stack(input_test_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n', test_seq_list[0][:10],'...', test_seq_list[0][-10:])
print('One hot encoding of Sequence #1:\n',input_test_features[0].T)
Now we will make a prediction for a random sequence with index 10 that was previously one-hot encoded.
input_test_features[10,:,:]
from tensorflow import keras
model = keras.models.load_model('weights.best_acc60_12genera.hdf5')
import warnings
warnings.filterwarnings("ignore")
model.predict_classes(input_test_features[10,:,:][np.newaxis,:,:])
We see that the classifier correctly predicted the sequence number 10 to belong to class 1 which is Gut, this makes sense since we are making predictions for a Stool sample from HMP. This was just one sequence. Now we are going to make predictions for all sequences from the HMP Stool sample and summarize the predictions by counting how many sequences were assigned to each class using the Counter function from collections module.
predicted_test_labels = model.predict_classes(np.stack(input_test_features))
predicted_test_labels
from collections import Counter
Counter(list(predicted_test_labels))
Here we see that the majority of sequences were correctly assigned to class 1 which is Gut. Let us convert these numbers into fractions / contributions of each class to the Stool HMP sample.
microbial_fractions = [Counter(list(predicted_test_labels))[0],
Counter(list(predicted_test_labels))[1],
Counter(list(predicted_test_labels))[2],
Counter(list(predicted_test_labels))[3]]
microbial_fractions = [i / sum(microbial_fractions) for i in microbial_fractions]
microbial_fractions
Here we can see that the Gut class got the highest fraction of 44% followed by Skin that got 31% of sequences assigned to this calss. Therefore we conclude that the sample is most likely from Gut / Stool, and this agrees with what we know about the sample as SRS022713 sample was annotated as "Stool" in the HMP project.
If we now want to consider only most reliably classified sequences, we can filter the prediction results and keep only the sequences were one class exceeds a certain probability threshold such as 80%. But first we need to get probabilities for each of the 4 classes for all sequences from the HMP Stool sample.
predicted_test_probs = model.predict(np.stack(input_test_features))
predicted_test_probs
cutoff = 0.8
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])
microbial_fractions=[Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i))>cutoff]])[0],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i))>cutoff]])[1],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i))>cutoff]])[2],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i))>cutoff]])[3]]
microbial_fractions = [i / sum(microbial_fractions) for i in microbial_fractions]
microbial_fractions
Thus when we selected only sequences with most reliable predictions (where one class got more than 80% probability), we obtained 67% of microbial sequences assigned to the Gut class. Therefore, selecting most reliable sequences seems to make sense, however restricting the number of sequences we should still have a fair number so that it is still possible to do some statistics on them.
sum(Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]]).values())
As we see, we still have enough sequences, i.e. over 2000, in order to realiably compute fractions of sequences assigned to each class.
Once we know how to make predictions of microbial composition for one sample, we will run predictions on 8 random samples from the HMP project. We will select 2 samples from each class, i.e. 2 Oral, 2 Gut, 2 Skin and 2 Vagina samples. Let us wrap the prediction procedure up as a loop over the samples:
from tensorflow import keras
model = keras.models.load_model('weights.best_acc60_12genera.hdf5')
import numpy as np
from Bio import SeqIO
from tensorflow import keras
from collections import Counter
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
tissue_list = ['Oral', 'Oral', 'Gut', 'Gut', 'Skin', 'Skin', 'Vagina', 'Vagina']
sample_list = ['SRS011243', 'SRS055495', 'SRS017247', 'SRS022713', 'SRS017156', 'SRS018369',
'SRS062752', 'SRS015071']
nt = 80
cutoff = 0.8
oral_list = []; gut_list = []; skin_list = []; vagina_list = []
fastq_path = '/home/nikolay/Documents/Medium/DeepLearningMetagenomics/'
for j in range(len(sample_list)):
print('Working with sample ' + str(sample_list[j]) + ' from tissue ' + str(tissue_list[j]))
test_seq_list = []
fastq_file = 'HMP_' + tissue_list[j] + '_' + sample_list[j] + \
'.denovo_duplicates_marked.trimmed.singleton.fastq'
for record in SeqIO.parse(fastq_path + fastq_file, "fastq"):
if len(str(record.seq)) >= 80:
test_seq_list.append(str(record.seq))
test_seq_list = [j.rstrip()[0:nt] for j in test_seq_list]
test_seq_list = [k for k in test_seq_list if
k.count('A') > 0 and k.count('C') > 0 and k.count('G') > 0 and k.count('T') > 0
and len(list(k)) == nt and 'N' not in k and 'H' not in k and 'K' not in k
and 'M' not in k and 'R' not in k and 'S' not in k and 'V' not in k
and 'W' not in k and 'Y' not in k and 'B' not in k]
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder()
input_test_features = []
for sequence in test_seq_list:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_test_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
input_test_features = np.stack(input_test_features)
#predicted_test_labels = model.predict_classes(np.stack(input_test_features))
#microbial_fractions = [Counter(list(predicted_test_labels))[0],
# Counter(list(predicted_test_labels))[1],
# Counter(list(predicted_test_labels))[2],
# Counter(list(predicted_test_labels))[3]]
predicted_test_probs = model.predict(np.stack(input_test_features))
microbial_fractions=\
[Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])[0],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])[1],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])[2],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])[3]]
microbial_fractions = [i / sum(microbial_fractions) for i in microbial_fractions]
print(microbial_fractions)
oral_list.append(microbial_fractions[0])
gut_list.append(microbial_fractions[1])
skin_list.append(microbial_fractions[2])
vagina_list.append(microbial_fractions[3])
Let us summarize the fractions of each class for the 8 random HMP samples as a table and plot them as a stacked barplot.
import pandas as pd
df = pd.DataFrame({'Sample': [i + '_' + j for i, j in zip(tissue_list, sample_list)],
'Oral': oral_list, 'Gut': gut_list, 'Skin': skin_list, 'Vagina': vagina_list})
df
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sns.set(font_scale = 1.8)
df.plot(x = 'Sample', kind = 'bar', stacked = True, title = 'Microbial composition of HMP samples',
figsize = (20,15))
plt.legend(loc = 'upper left', prop = {'size': 20})
plt.show()
We can see that we get fairly good predictions for each random HMP sample, i.e. first two looks rather like Oral samples since Oral microbes (the blue bar) comprise the most of the microbial community, while the second two look like Gut samples etc.
Above, we trained a CNN classifier using reference genomes of microbes abundant in different tissues of HMP project. We proceeded with evaluating the model on a few random HMP samples. Here might be a problem. On the one hand, we did not use raw sequences from the HMP samples when training the CNN classifier, but instead we used the reference genomes of most abundant genera. On the other hand, there is still an information leakage because we still used the same project (althoug different types of data) for both training and evaluating. Therefore the evaluation might till be biased and look too good to be true. To make a better evaluation, we will select 8 random Oral, Gut, Skin and Vagina samples from random metagenomics projects that are not related to the HMP.
Below, we will make predictions for a few samples other than HMP samples, i.e. the DIABIMMUNE, files, the WGS Oral Microbiome, here are the files and the publication, projects. Skin samples were downloaded from the MGnify resource provided by the EBI from the project "Patterns in the skin microbiota differ in children and teenagers between rural and urban environments", the files are here. Vaginal swabs were sequenced in the project MOMS-PI, the data are available from here.
We will apply the same loop over the samples as we used above for the HMP samples with only one difference. We need to always control that after applying a probability threshold (here 90%) we still get enough sequences to compute the fractions of microbial classes. If we do not have 500 sequences (this is true for some of the random non-HMP samples due to their low sequencing depth), the threshold will be lowered from 0.9 down to 0.6.
from tensorflow import keras
model = keras.models.load_model('weights.best_acc60_12genera.hdf5')
import random
import numpy as np
from Bio import SeqIO
from tensorflow import keras
from collections import Counter
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
random.seed(123)
import warnings
warnings.filterwarnings("ignore")
tissue_list = ['Oral', 'Oral', 'Gut', 'Gut', 'Skin', 'Skin', 'Vagina', 'Vagina']
sample_list = ['ERR3827177.fastq', 'ERR3827196.fastq', 'G69146_pe_1.fastq', 'G69153_pe_1.fastq',
'4491200149-1L1_S12_L001_R1_001.fastq', '4491200209-1L1_S66_L001_R1_001.fastq',
'SRR6900017_1.fastq', 'SRR6899929_1.fastq']
nt = 80
cutoff = 0.9
oral_list = []; gut_list = []; skin_list = []; vagina_list = []
fastq_path = '/home/nikolay/Documents/Medium/DeepLearningMetagenomics/'
for j in range(len(sample_list)):
print('Working with sample ' + str(sample_list[j]) + ' from tissue ' + str(tissue_list[j]))
test_seq_list = []
fastq_file = sample_list[j]
for record in SeqIO.parse(fastq_path + fastq_file, "fastq"):
if len(str(record.seq)) >= 80:
test_seq_list.append(str(record.seq))
test_seq_list = random.sample(test_seq_list, 100000)
test_seq_list = [j.rstrip()[0:nt] for j in test_seq_list]
test_seq_list = [k for k in test_seq_list if
k.count('A') > 0 and k.count('C') > 0 and k.count('G') > 0 and k.count('T') > 0
and len(list(k)) == nt and 'N' not in k and 'H' not in k and 'K' not in k
and 'M' not in k and 'R' not in k and 'S' not in k and 'V' not in k
and 'W' not in k and 'Y' not in k and 'B' not in k]
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder()
input_test_features = []
for sequence in test_seq_list:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_test_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
input_test_features = np.stack(input_test_features)
#predicted_test_labels = model.predict_classes(np.stack(input_test_features))
#microbial_fractions = [Counter(list(predicted_test_labels))[0],
# Counter(list(predicted_test_labels))[1],
# Counter(list(predicted_test_labels))[2],
# Counter(list(predicted_test_labels))[3]]
predicted_test_probs = model.predict(np.stack(input_test_features))
if sum(Counter([np.argmax(j) for j in [i for i in predicted_test_probs \
if max(list(i)) > cutoff]]).values()) < 500:
cutoff = 0.6
microbial_fractions=\
[Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])[0],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])[1],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])[2],
Counter([np.argmax(j) for j in [i for i in predicted_test_probs if max(list(i)) > cutoff]])[3]]
microbial_fractions = [i / sum(microbial_fractions) for i in microbial_fractions]
print('Probability cutoff: {}'.format(cutoff))
print(microbial_fractions)
oral_list.append(microbial_fractions[0])
gut_list.append(microbial_fractions[1])
skin_list.append(microbial_fractions[2])
vagina_list.append(microbial_fractions[3])
import pandas as pd
df = pd.DataFrame({
'Sample':
[i +'_'+j
for i,j in
zip(tissue_list,
[k.replace('.fastq','').replace('_1','').replace('_001','').replace('_R1','').replace('_L001','')
for k in sample_list])], 'Oral': oral_list, 'Gut': gut_list, 'Skin': skin_list, 'Vagina': vagina_list})
df
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sns.set(font_scale = 1.8)
df.plot(x = 'Sample', kind = 'bar', stacked = True,
title = 'Microbial composition of random samples from different metagenomics projects', figsize = (20,15))
plt.legend(loc = 'upper left', prop = {'size': 20})
plt.show()
Here we again see that the predictions of microbial compositions in the 8 non-HMP samples fairly agree with the true Oral, Gut, Skin and Vaginal microbial communities that the samples were taken from. Well done, our trained CNN classifier can provide meaningful estimates of microbial composition of any given sample.
Here we will save some handy codes that helped us in the analysis above. Later for the final version of the notebook, they should be removed.
[print(i) for i in microb.columns if 'Clostridium_perfringens' in i]
genus='k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_perfringens'
microb.columns.get_loc(genus)
ind = 1645
my_data = pd.DataFrame({'Tissue': list(['Oral']*phen['STArea'].value_counts()['Oral'] +
['Skin']*phen['STArea'].value_counts()['Skin'] +
['Vaginal']*phen['STArea'].value_counts()['Vaginal'] +
['Gut']*phen['STArea'].value_counts()['Gut']),
'Abundance': list(microb.loc[phen['ID'][phen['STArea'] == 'Oral'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Skin'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Vaginal'],].iloc[:, ind]) +
list(microb.loc[phen['ID'][phen['STArea'] == 'Gut'],].iloc[:, ind])})
my_data.groupby('Tissue', sort = False).mean().sort_values('Abundance', ascending = True)
0.072794/0.000257